Capstone Project - The Battle of the Neighborhoods (Week 1)

Applied Data Science Capstone by IBM/Coursera

Fernando Tauscheck

Table of contents:

  1. Introduction: Business Problem
    1 Curitiba:
  2. Data
    2.1 Geographic Data:
    2.2 Foursquare:
    2.1.1 Reference points:
    2.1.2 Dataframe from Foursquare:
    2.3 Socioeconomic data of the neighborhoods:

3 Work Flow:

1. Introduction: Business Problem

What defines the success of a commercial business? Can we predict if a point is good enough to open a profitable bakery?

Although the analysis can, in theory, be replicated for any type of business, this report will be targeted to stakeholders interested in opening a bakery in Curitiba, Brazil. We will use geographic and socioeconomic data from existing bakeries to define a list of possibles location to open a bakery.

1.1 Curitiba:

Curitiba is the capital and largest city in the Brazilian state of Paraná. The city's population was 1,948,626 as of 2020, making it the eighth-most populous city in Brazil and the largest in Brazil's South Region. According to Foursquare, Curitiba has 608 bakeries, of which:

title

2. Data:

Some factors will influence our analysis:

As a data aggregation tool, RDMBS PostgreSQL (PostGIS) will be used with 'Spatial Analysis Functions'.

2.1: Start the code:

2.2 Foursquare:

This project uses the Foursquare API as its main data gathering source as it has a database of millions of venues. To restrict the number of venues to request to Foursquare API, only places classified as bakery were filtered. To mitigate the problem with neighborhoods with more than 100 bakeries (an API limitation), we will query the API in clusters of hexagons with 600m of radius. The coordinates of these hexagons were generated through code, starting from a central point in Curitiba. All points were validated if they were 'within' the Curitiba area through a SQL query. The coordinate of the central point was defined with a request to ‘Google Geocode API’ using the neighborhood ‘Fany’ as the parameter. With the venues list, an additional request was made to retrieve details of each venue:

2.2.1 Retrieve Curitiba Coordinates - Google Geocode:

Starting from a geographically central point in Curitiba (not necessarily in the downtown area), we use the Google Geocode API to obtain the coordinates of this point. These coordinates will be used as the starting point for defining the collection and analysis points, and as the center point of the maps used in this report.

The "Fanny" neighborhood will be the starting point.

2.2.2 Calculating reference points to request Venues from FourSquare API:

With the definition, in the previous function, of the central coordinates, equidistant points (vertices of hexagons) will be defined covering the entire area of the municipality. Starting from these points, the Foursquare API will be questioned (providing a calculated radius).

2.2.3 Request Venues (and details) to FourSquare API:

With the points calculated in the previous function, the Foursquare API is called. As query radius for the API, we used the vertex of the hexagon plus a 1% margin of error.

The output of this code is directly stored in a table on PostgreSQL.

2.2.3.2 Dataframe of Venues:

2.3 Geographic Data:

We will get geographic information from Curitiba at the website of the "Instituto de Pesquisa e Planejamento Urbano de Curitiba" (Institute of Urban Planning and Research of Curitiba also know as IPPUC)1. The Institute provides all sorts of maps of Curitiba. We will use:

These maps are provided in SHP format (ESRI). Posteriorly they were converted to GeoJSON in a proper representation (WGS84). The GeoJSON files was inserted in an RDMBS (PostgreSQL), where will be used the Post GIS extension to analyze. At the GitHub of this project2, you can find the structure of the tables (SQL File).

2.3.1 Loading Neighborhoods GeoJSON to Database:

Load file support/GeoJSON/Curitiba_neighbourhood.geojson into a PostgreSQL table.

2.3.1.2 Plotting Neighborhoods:

2.3.2 Loading Master Plan GeoJSON to Database:

Load file support/GeoJSON/Curitiba_master_plan.geojson into a PostgreSQL table.

2.3.2.2 Plotting Master Plan:

2.3.3 Loading Main Streets GeoJSON to Database:

Load file support/GeoJSON/Curitiba_main_streets.geojson into a PostgreSQL table.

2.3.4 Loading Main Streets GeoJSON to Database:

Load files with extra areas into a PostgreSQL table:

2.3.5 Socioeconomic data of the neighborhoods:

The socioeconomic data of the municipality was be collected from the Wikipedia article1: "Lista de bairros de Curitiba".

3 Methodology:

The objective of this project is to find regions in Curitiba with the best conditions for opening a high-income bakery.

In a first step, we collect all relevant data. Geographical data provided by the city of Curitiba (through IPPUC), socioeconomic data (collected on Wikipedia) and location and classification data of current bakeries were considered. For this, we use the Foursquare API. All data were submitted to tables in PostgreSQL database.

In a second step, there will be data exploration. For this purpose, the city will be divided into hexagons with a radius of 300m. For each of these 'areas', geographic and socioeconomic data will be added that will allow the application of the K algorithm - Nearest Neighbors (KNN).

Only in the third stage of the project will data from current establishments be added to the study. With this, we will be able to define the regions with the greatest potential and classify them for our investors.

4 Analysis:

4.1 Creating hexagon to study:

To start, let's create the hexagons that will be the basis of the study.

4.2 Hexagons vs Geographic Data:

Using PostGIS to fill an Pandas dataframe.

4.2.1 Socioeconomic data of each Hexagons:

In the database table, we have a column with the GeoJSON object of each hexagon. Similar information we have in the neighborhood table, with the borders of each neighborhood. Using PostGIS functions and Socioeconomic data extracted from Wikipedia, we will calculate the area of each neighborhood overlapping each hexagon. Applying the proportionality of population and income to the overlapping area (about the neighborhood area), we will have this information for each hexagon.

To reduce the number of hexagons in the study and focus on regions with greater purchasing power, the bank's consultancy filters the hexagons that make up 85% of the municipality's revenue.

4.2.2 Master Plan of each Hexagons:

Applying the same logic as in the previous block, we calculate the overlapping area of each type of zone in the Master Plan.

4.2.3 Main Streets of each Hexagons:

Applying the same logic, we calculate the lenght of each type of Main Streets in each hexagon.

4.2.4 Extras of each Hexagons:

Applying the same logic, we calculate the overlapping area of each type 'extras' in Curitiba.

4.3 Processing data:

As a first action, let's put all data in the same dataframe.

4.3.1 Normalizing:

4.3.2 Clustering:

Using the 'KMeans' function of scikit-learn, we will cluster the study areas into 6 groups.

4.4 Bringing Foursquere data into analysis:

Up to this point in the study, no information from Foursquare has been used, as until then the aim was to classify the points based on their geographic, socioeconomic and legal characteristics. To evolve the analysis, let's add information about the establishments we collected from Foursquere. We will not treat the establishments individually, but based on the hexagons in the study.

4.4.1 Venus per Hexagon:

4.4.2 Venus distance to Hexagon:

4.4 Defining the relevant clusters:

With the information of Bakeries per hexagon and the cluster of each hexagon, we can select the clusters with the highest averages of bakeries per hexagon.

4.5 A peek in the data:

Based on the classification of the clusters, we will analyze the clusters with an average of bakers per hexagon greater than 0.25. In our case it will be clusters 4, 2 and 0.

As we can see in the map above, analyzing only the distance between a hexagon and the closest high-income bakery, the choice would fall to the extreme points (greater distance). But continuing the analysis, we have more information that can help us in the analysis:

4.5.1 Hexagon per Population and per Income:

CONTINUAR DQ (Finalizar a análise):